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Abstract 

Conventional statistics begins with a model, and assigns a likelihood of ob- 
taining any particular set of data. The opposite approach, beginning with the 
data and assigning a likelihood to any particular model, is explored here for 
the case of points drawn randomly from a continuous probability distribution. 
A scalar field theory is used to assign a likelihood over the space of probability 
distributions. The most likely distribution may be calculated, providing an 
estimate of the underlying distribution and a convenient graphical representa- 
tion of the raw data. Fluctuations around this maximum likelihood estimate 
are characterized by a robust measure of goodness-of-fit. Its distribution may 
be calculated by integrating over fluctuations. The resulting method of data 
analysis has some advantages over conventional approaches. 



When the outcome of an experiment falls into one of a few categories, the frequency of 
a particular outcome is an estimate of its probability. For example, by repeatedly flipping 
a coin we learn about the probability of obtaining heads. But when the outcome of an 
experiment is one of a continuum, no finite set of data can determine the frequency of 
each outcome. One common method of estimating the underlying probability distribution 
is to group observations into categories, a procedure known as "binning." The histogram 
(the frequency of observations in each bin) is then used as an estimate of the underlying 
probability distribution. While binning is widely used, it has a number of undesirable 
consequences. It requires a choice of bins (both their number and sizes), and different 
choices lead to different histograms. Thus even the appearance of raw data, when presented 
in graphical format, depends on arbitrary choices. Binning also throws information away, 
since different outcomes are grouped together. 

An alternative approach has been presented |jl|J^ to estimate the probability distribution. 
These authors assign a likelihood P[Q\xi, . . . ,xn] that the distribution Q{x) describes the 
data Xi,...,X]\f. The underlying distribution might then be estimated as the one which 
maximizes P[Q\xi, . . . , xn]- By Bayes' rule, 
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where P[Q] is some a priori likelihood of the distribution Q. As no finite set of data can 
specify an arbitrary function of a continuous variable, a choice for P[Q] is necessary to 
regularize the inverse problem. This choice encapsulates our baises in an explicit fashion. 
(These biases are implicit in other approaches, e.g., in our interpretation of a histogram.) 

What form should P[Q] have? By setting Q{x) = ip'^{x) [|I|, where ip may take any value 
in (—00, 00), we may insure that Q is non-negative, ip will be referred to as the amplitude 
by analogy with quantum mechanics. P[Q] should incorporate our bias that Q be "smooth" 
0. "Smoothness" is enforced by penalizing large gradients in Q — or rather, in ip. Finally, 
Q should be normalized. In one dimension, the a priori distribution is 
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where Z is the normalization factor and £ is a constant which controls the penalty applied 
to gradients. The delta function enforces normalization of the distribution Q. 

The probability P[Q\xi, . . . ,xn] of a distribution Q, given the data, is therefore 
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where the effective action 5* is 
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What is the most likely distribution (amplitude), given the data? From Eq. (|^), this is 
the ip which minimizes the action, subject to the normalization constraint. This -p will be 
called the classical amplitude, -pd- To handle the normalization constraint, we subtract a 
Lagrange multiplier term A(l — / dxtp'^) from the action; -p^ satisfies the equations 
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The solution to these equations may be written 
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where = 2A/£^. Each data point therefore contributes one peak of width 1/k to the 
amplitude -pel- This is reminiscent of kernel estimation [Q, using the amplitude rather than 
the probability distribution. Eqs. (|^ imply 
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FIG. 1. The classical action, Eq. (|10|), as a function of Ihk for data drawn randomly from 
a gaussian distribution with zero mean and unit variance. Long dash, N = 2000; short dash, 
= 200; dots, = 20. 
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These iV + 1 equations determine A and the function of k [E|. 

Using the equation of motion, Eqs. (0), the classical action Slipd] may be written 
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For the proper choice of k one might hope that Qd ~ Q, the true distribution. Since the 
data points from the true distribution Q{x), we expect 
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Therefore, the last term of Eq. (pi!|) is approximately / dx Q{x)lnQ{x), which can be 
interpreted as the entropy (or the information 0). Using perturbation theory one may 
show that when Qd ~ Q, then A ~ A^, so the first two terms of Eq. ( [Tol) (the penalty for 
gradients) approximately cancel (more precisely, increase much less rapidly than N). 

How does one choose k? In Figure |I], the classical action is plotted against In k for data 
sets generated from a gaussian distribution. One sees that, over a region of width IniV, 
5'['?/'ci] is insensitive to the precise choice of k. Therefore, k may be chosen by finding the 
point of minimum sensitivity |(iS'['?/'ci]/'^lnK| [^fl- 

Once K has been chosen, the maximum likelihood distribution Qci{x) = ipcii^) is uniquely 
determined. An example of results from this procedure are shown in Figure H. One sees 
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FIG. 2. The classical distribution for data drawn randomly from a gaussian distribution 
(solid line). Dashed curve, A'^ = 2000; dotted curve, N = 20. 



convergence towards the underlying distribution as N increases. Note that even for = 20 
the estimate Qci is illuminating; the advantages of this method over binning are especially 
great for small data sets. 

While Qc\ represents the most likely distribution, other "nearby" distributions should 
also be considered. The action may be expanded around the classical amplitude, which to 
second order in the fluctuations 5ip yields 

s[^ij,, + 5ij]^s[M + \x^m 

+ Jdx (^^{dMy + xd^'^y (12) 

where 

is a measure of the goodness of fit between a trial distribution Q = ip"^ and the data. 
It is the direct analogue of the conventional (which here will be called Xi); to see this, 
re-write 

X =4: dx -2—- }^6{x-Xi) 
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using Eq. ([Tl|). Now suppose that Q and Q are close, Q{x) = Q{x) + e(x). Then we may 
expand the difference of square roots as 
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which estabhshes the connection to the traditional definition 

This definition of has a number of advantages over x\- Because of the quadratic 
dependence on e and the Q term in the denominator, is quite sensitive to the tails of 
distributions. In contrast, as defined in Eq. (0) is robust. It is linear in |e| when |e| is 
large, and has no potentially small term in the denominator. Therefore, this definition 
is more robust than Xi- Another advantage is that binning is unnecessary. This eliminates 
the problems of lost information and arbitrary bin-sizes and -boundaries (and simplifies 
the process of fitting, as one need not worry about shifting bin-boundaries). Finally, this 
definition of x^ is essentially symmetric (exactly so in Eq. ([T^)); ^^nd consequently is a true 
metric on the space of probability distributions. (The form in Eq. (|14D is known as the 
squared Hellinger distance [Q.) 

How is x^ distributed? To lowest order, the likelihood of any particular fiuctuation rj is 
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The distribution P(x^) may in principle be calculated by integrating Eq. (|l^) over all rj with 
fixed x^; cL realizable alternative is to calculate its Laplace transform, P{a) = (e~"^ '''1), 
where the expectation is relative to the distribution of t] in Eq. (|l^). 

One challenge in evaluating any integral over t] is the "orthogonality condition" 
5 (/ dxipciv) ill Eq. (|l^). One way to handle this condition is to use the delta-function 
representation 5{y) = \im^^Q+ ^^e~y This adds a term {J dx tpcivy / ^ to the argu- 
ment of the exponential; the path integral may then be expressed formally in terms of 
det(L -|- '0ci ® ^ci/^)'^^'^, where L is the appropriate operator (arising from the action, 
Eq. (0)) and V'ci ® i^ci is the matrix with the {x,x') element equal to ipci{x)ipci{x'). The 
non-local terms proportional to ^ are large and must be handled first. We know that 
lim^^o+ edet(L + ipci ® V'ci/e) must be finite, so all the terms diverging worse than ^ in the 
determinant must vanish. (This happens because of the all-order singularity of the matrix 
V'ci ® V'ci-) So even though ^ is large, we may evaluate this determinant exactly by working 
to first order in -. Therefore 

det + 5*if*^) = det L det (l + Ifl^^i®*!) 

^detL(l + ^''"-"'^;'«^'")). (17) 

Now we can take the limit e — > 0"^; the integral over all rj is now complete. The distribution 
of x^ (properly normalized) is therefore 
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where 7 = 4a + 1, 



det(-£2a2 + 2A) ' ^ ^ 

T(7) = j dx dx' Ky{x, x')^pc\{x)'ijjciix'), (20) 
and the propagator = satisfies 

- e'^d^K^ + 2XK^ + It E '^(^ - ^^)^7 = ^(^ - ^')- (21) 
Qci i 

The terms of Eq. (0) can be evaluated exactly. First, consider the ratio of the determi- 
nants, Eq. (|1^). Standard techniques [|I^ allow one to express D{j) as the hmit as x ^ 00 



of the function E{x] 7), where E satisfies 

- dlE - 2Kd^E + ^ ^ 5(x - x,)E = (22) 

and E{x) = 1 for x smaller than the smallest data point. Between data points, E{x) = 
Ei + p^Q-'^i^i^-^i) ^ and a short calculation shows that Ei and Fi satisfy a simple recursion 
relation. 

The traces T{j) are computed as follows: let g-yi^x) = J dx' K^{x,x')ipc\{x') and = 
J dx' Kq{x,x')iPci{x'). g-y may be parametrized as 



^7,(x) = (7o(a:) + ^Ec.e-^''-^'l (23) 



and from Eq. (pT[) the Cj satisfy the linear equations 

Ci + E [cj + (1 + f^\xi - Xj\)aj] e-'^l^'-^^-l = 0. (24) 
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where Uj = — r. Then TM may be expressed in terms of the q by computing the 
remaining integral over x (which may be done analytically). 

This completes the evaluation of the distribution of x^- Oiis sees that different data 
sets yield different P(x^). Therefore, it may be illustrative to consider the limit of large N, 
where the distribution of assumes a more universal form. 

In the limit of large N, we may put Q^i ~ Q and A ~ A^. We write ^ form 

similar to Eq. (p!^ , but introduce a small but necessary change: ~ J-^dxSip'^ where, 
heuristically, X is the region over which we may expect to find data points. We need only 
the size X of X, which may be defined as X = X^i q^{^- The determinant operator is 
£"^{—81 + K^) outside X, and £^(—9^ + fi;^(l + 7)) inside X. Then the ratio of determinants 
(ignoring all but the exponential-order terms) is D{^) ^ g«(\/i+7-i)^_ fjj-^g traces do not 
contribute to the exponential-order terms. Consequently, 
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P(«)^e-<^'Kv^-0, (25) 

where (x^) ~ kX/\/2. Note that if we identify I/k as the effective bin width, then (x^) is 
approximately 1/a/2 per bin, i.e., ~ 0.7 per degree of freedom. We may invert the Laplace 
transform in Eq. (BH) to obtain 



(26) 



The conventional approach to statistics emphasizes the model: given a model, one cal- 
culates the likelihood of obtaining a particular data set. This likelihood is measured by the 
conventional x^- Ks distribution is over (hypothetical) repeated trials of the experiment, 
assuming gaussian errors. In contrast, the approach presented here emphasizes the data: 
given a data set, one calculates the likelihood that it is described by a particular model. 
This likelihood is measured by x^; its distribution is over all possible models. 

The approach presented here has two major advantages over conventional methods. First, 
it provides a technique for visualizing data sets, retaining all the information in the data and 
requiring no arbitrary choices. Second, it provides a robust measure of goodness-of-fit. Its 
distribution can be calculated, and so may be used for statistical analysis. The availability 
of a fast algorithm P] makes computation time negligible even for large data sets. This 
technique should be generalizable to higher dimensions [@]. 
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